In [3]:
#imports
import pandas as pd
import numpy as np
In [4]:
raw_data = pd.read_csv('titanic_data.csv')
In [26]:
raw_data.head()
Out[26]:
In [9]:
raw_data.describe()
Out[9]:
we realise however that in this way we are not able to see NA in non-numeric columns.
We move to another option:
In [12]:
raw_data.isnull().sum()
Out[12]:
Out of 891 rows, we have 177 NaN, which represent roughly a 20%. If we replace this NaN with some other value we should be guard value, so it does not affect the rest of the values.
Out of 891 rows, 687 are nulls, representing an astounding 77%. Ignoring this column altogether makes more sense.
Only 2 NaN in this column make it possible to simply ignore this rows. We could also decide another value and see how they behave.
In [6]:
clean_data = raw_data.copy()
clean_data['Age'] = clean_data['Age'].fillna(-1)
In [7]:
clean_data.drop('Cabin', axis=1, inplace=True)
In [27]:
raw_data[raw_data['Embarked'].isnull()]
Out[27]:
It looks a bit strange that they both survived, are in the same Cabin and we lack their Embarked information, using the same ticket.
Instead of deleting them we will leave the rows for now.
This are configuration options for the charts.
In [5]:
%pylab inline
figsize(47,20)
We want to be able to see all this data depicted in this ways:
To be able to see where the survival rates are most gathered.
As a first data exploration trade we are interested first, in how many people survived.
In [79]:
import matplotlib.pyplot as plt
survivors = clean_data.groupby('Survived').count()['Name']
plt.figure(figsize=(18,8))
cmap = plt.cm.hsv
colors = ['grey','cyan']
plt.pie(survivors, labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors)
plt.axis("equal")
plt.title("Titanic Survivors")
plt.show();
In [41]:
clean_data[clean_data['Survived'] == 1].groupby('Age').count().reset_index().plot(kind='bar',y='PassengerId', x='Age')
#pd.pivot_table(clean_data[clean_data['Survived'] == 1], index='Age', aggfunc=np.count_nonzero
Out[41]:
But this is not very helpful, since we don't see how many people there was in each group. We can either represent both survivors or not, or calculate a ratio by age.
Let's see which helps us more.
In [84]:
#clean_data.groupby(['Age','Survived']).count().reset_index().plot(kind='bar',stacked = True, y='PassengerId', x='Age')
pivot_age = pd.pivot_table(clean_data, values='PassengerId', index='Age', columns='Survived', aggfunc=np.count_nonzero)
pivot_age.fillna(0).plot(kind='bar', stacked='True')
Out[84]:
From what we can see, not much information can be gained from age, but let's analyse by ratio, to be certain about that.
In [94]:
pivot_age = pivot_age.fillna(0)
pivot_age['survival_ratio'] = pivot_age[1] / (pivot_age[0] + pivot_age[1])
pivot_age.plot(kind = 'bar', y='survival_ratio')
Out[94]:
From this plot we can extract that the higher ratios of survival are up to 9 years, and between 11 and 14. Some other interesting ranges of age have good survival rates, like from 47 to 55.
In [99]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=1, ncols=2)
survivors_male = clean_data[clean_data['Sex']=='male'].groupby('Survived').count()['Name']
survivors_female = clean_data[clean_data['Sex']=='female'].groupby('Survived').count()['Name']
colors = ['grey','cyan']
male_plot = survivors_male.plot(kind='pie', labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors, ax=axes[0])
male_plot.axis("equal")
male_plot.set_title("Male Titanic Survivors")
female_plot = survivors_female.plot(kind='pie', labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors, ax=axes[1])
female_plot.axis("equal")
female_plot.set_title("Female Titanic Survivors")
Out[99]:
As we can clearly see with this representation, we have a lot of females surviving. Around a 74 %.
Only with this information we could already have a pretty good prediction.
In [103]:
survivors_male_age_pivot = clean_data[clean_data['Sex']=='male'].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
survivors_male_age_pivot = survivors_male_age_pivot.fillna(0)['PassengerId']
survivors_male_age_pivot['survival_ratio'] = survivors_male_age_pivot[1]/(survivors_male_age_pivot[1]+survivors_male_age_pivot[0])
survivors_male_age_pivot.plot(kind='bar', y='survival_ratio')
Out[103]:
With this representation we can clearly see that the 0 to 6 year old males are the ones that survive the most.
With females we want to study which where the ages that died the most, since we have a lot more women surviving.
In [104]:
survivors_female_age_pivot = clean_data[clean_data['Sex']=='female'].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
survivors_female_age_pivot = survivors_female_age_pivot.fillna(0)['PassengerId']
survivors_female_age_pivot['dead_ratio'] = survivors_female_age_pivot[0]/(survivors_female_age_pivot[1]+survivors_female_age_pivot[0])
survivors_female_age_pivot.plot(kind='bar', y='dead_ratio')
Out[104]:
In [108]:
clean_data['Pclass'].head()
Out[108]:
We see data is structured in values ranging from 1 to 3. Standin for 1st class (richer) to 3rd class (poorer).
In [109]:
survivors_first_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 1)
survivors_first_age_pivot.plot(kind='bar', y='survival_ratio')
Out[109]:
In [114]:
def get_survival_ratio_pivot(source, attribute, value):
pivot = source[source[attribute]==value].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
pivot = pivot.fillna(0)['PassengerId']
pivot['survival_ratio'] = pivot[1]/(pivot[1]+pivot[0])
return pivot
In [115]:
survivors_second_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 2)
survivors_second_age_pivot.plot(kind='bar', y='survival_ratio')
Out[115]:
This distribution is more revealing. People from second class only got saved if they were extremely young. At this point it would be helpful to know how many people this represented.
In [126]:
survivors_second_age_pivot.columns = ['Died', 'Survived', 'Ratio']
ssap_plot = survivors_second_age_pivot.plot(kind='bar',stacked = True, y=[0,1])
#ssap_plot.set_label(['Died','Survived'])
In [127]:
survivors_third_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 3)
survivors_third_age_pivot.plot(kind='bar', y='survival_ratio')
Out[127]:
This distribution shows that just by being on 3rd class, your chances of surviving were a lot lower. Let's calculate how lower.
In [130]:
survived_by_class = clean_data.pivot_table(index='Pclass', columns='Survived', aggfunc=np.count_nonzero)['PassengerId']
survived_by_class['ratio'] = survived_by_class[1]/(survived_by_class[1]+survived_by_class[0])
survived_by_class
Out[130]:
In [11]:
from pivottablejs import pivot_ui
pivot_ui(clean_data)
Out[11]:
With the help of this tool we see that the best result is:
In [15]:
class_gender_pivot = pd.pivot_table(clean_data, index=['Pclass','Sex'],columns='Survived', aggfunc=np.count_nonzero)['PassengerId']
class_gender_pivot['survival_ratio'] = class_gender_pivot[1]/(class_gender_pivot[1]+class_gender_pivot[0])
class_gender_pivot
Out[15]:
With this informations we can say that higher class means life, specially for men, that have their chances more than doubled. Woman in higher and middle class survived. And woman in lower classes had exactly 50% chances of surviving.
After analysing the data, we can state that: